Recognition models to predict DNA-binding specificities of homeodomain proteins
نویسندگان
چکیده
MOTIVATION Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C(2)H(2) zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes. RESULTS Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests (RFs) can produce recognition models for HD proteins that are significant improvements over KNN-based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting position frequency matrices from protein sequence using a RF-based model.
منابع مشابه
Analysis of Homeodomain Specificities Allows the Family-wide Prediction of Preferred Recognition Sites
We describe the comprehensive characterization of homeodomain DNA-binding specificities from a metazoan genome. The analysis of all 84 independent homeodomains from D. melanogaster reveals the breadth of DNA sequences that can be specified by this recognition motif. The majority of these factors can be organized into 11 different specificity groups, where the preferred recognition sequence betw...
متن کاملCooperative DNA-binding and sequence-recognition mechanism of aristaless and clawless.
To achieve accurate gene regulation, some homeodomain proteins bind cooperatively to DNA to increase those site specificities. We report a ternary complex structure containing two homeodomain proteins, aristaless (Al) and clawless (Cll), bound to DNA. Our results show that the extended conserved sequences of the Cll homeodomain are indispensable to cooperative DNA binding. In the Al-Cll-DNA com...
متن کاملCovariation between homeodomain transcription factors and the shape of their DNA binding sites
Protein-DNA recognition is a critical component of gene regulatory processes but the underlying molecular mechanisms are not yet completely understood. Whereas the DNA binding preferences of transcription factors (TFs) are commonly described using nucleotide sequences, the 3D DNA structure is recognized by proteins and is crucial for achieving binding specificity. However, the ability to analyz...
متن کاملVariation in Homeodomain DNA Binding Revealed by High-Resolution Analysis of Sequence Preferences
Most homeodomains are unique within a genome, yet many are highly conserved across vast evolutionary distances, implying strong selection on their precise DNA-binding specificities. We determined the binding preferences of the majority (168) of mouse homeodomains to all possible 8-base sequences, revealing rich and complex patterns of sequence specificity and showing that there are at least 65 ...
متن کاملA Lexicon for Homeodomain-DNA Recognition
Decoding the cis-regulatory logic of eukaryotic genomes requires knowledge of the DNA-binding specificities of all transcription factors. New work (Berger et al., 2008; Noyes et al., 2008) provides individual specificities for nearly all Drosophila and mouse homeodomains, key DNA-binding domains in many transcription factors. The data underscore the complexity of determining target specificitie...
متن کامل